Search CORE

71 research outputs found

The Strengths and Pitfalls of Large-Scale Text Mining for Literary Studies

Author: Hengchen Simon
Tahmasebi Nina
Publication venue
Publication date: 01/01/2019
Field of study

This paper is an overview of the opportunities and challenges of using large-scale text mining to answer research questions that stem from the humanities in general and literature specifically. In this paper, we will discuss a data-intensive research methodology and how different views of digital text affect answers to research questions. We will discuss results derived from text mining, how these results can be evaluated, and their relation to hypotheses and research questions. Finally, we will discuss some pitfalls of computational literary analysis and give some pointers as to how these can be avoided.Peer reviewe

Publikationer från Uppsala Universitet

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Helsingin yliopiston digitaalinen arkisto

Text Mining for User Query Analysis: A 5-Step Method for Cultural Heritage Institutions

Author: Chardonnens Anne
Hengchen Simon
Publication venue: Werner Hülsbusch
Publication date: 01/01/2017
Field of study

Peer reviewe

DI-fusion

Helsingin yliopiston digitaalinen arkisto

Semantic Enrichment of a Multilingual Archive with Linked Open Data

Author: De Wilde Max
Hengchen Simon
Publication venue
Publication date: 01/01/2017
Field of study

This paper introduces MERCKX, a Multilingual Entity/Resource Combiner & Knowledge eXtractor. A case study involving the semantic enrichment of a multilingual archive is presented with the aim of assessing the relevance of natural language processing techniques such as named-entity recognition and entity linking for cultural heritage material. In order to improve the indexing of historical collections, we map entities to the Linked Open Data cloud using a language-independent method. Our evaluation shows that MERCKX outperforms similar tools on the task of place disambiguation and linking, achieving over 80% precision despite lower recall scores. These results are encouraging for small and medium-size cultural institutions since they demonstrate that semantic enrichment can be achieved with limited resources.Peer reviewe

DI-fusion

Helsingin yliopiston digitaalinen arkisto

Exploring archives with probabilistic models : topic modelling for the valorisation of digitised archives of the European Commission

Author: Coeckelbergs Mathias
Hengchen Simon
Steiner Thomas
van Hooland Seth
Verborgh Ruben
Publication venue
Publication date: 01/01/2016
Field of study

Ghent University Academic Bibliography

An Unsupervised method for OCR Post-Correction and Spelling Normalisation for Finnish

Author: Duong Quan
Hengchen Simon
Hämäläinen Mika
Publication venue: 'Linkoping University Electronic Press'
Publication date: 03/11/2020
Field of study

Peer reviewe

arXiv.org e-Print Archive

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Helsingin yliopiston digitaalinen arkisto

The omnipresence of the nation

Author: Hengchen Simon
Marjanen Jani
Ros Ruben
Tolonen Mikko
Publication venue
Publication date: 01/03/2019
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Dataset for Temporal Analysis of English-French Cognates

Author: Coustaty Mickaël
Doucet Antoine
Frossard Esteban
Hengchen Simon
Jatowt Adam
Publication venue: European Language Resources Association (ELRA)
Publication date: 13/05/2020
Field of study

Languages change over time and, thanks to the abundance of digital corpora, their evolutionary analysis using computational techniques has recently gained much research attention. In this paper, we focus on creating a dataset to support investigating the similarity in evolution between different languages. We look in particular into the similarities and differences between the use of corresponding words across time in English and French, two languages from different linguistic families yet with shared syntax and close contact. For this we select a set of cognates in both languages and study their frequency changes and correlations over time. We propose a new dataset for computational approaches of synchronized diachronic investigation of language pairs, and subsequently show novel findings stemming from the cognate-focused diachronic comparison of the two chosen languages. To the best of our knowledge, the present study is the first in the literature to use computational approaches and large data to make a cross-language diachronic analysis.Peer reviewe

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Helsingin yliopiston digitaalinen arkisto